Math 220 Notes

Hi! This is RSTudio!

2+2
## [1] 4

Week 1

Wed Jan 18

Data Basics

Many formats for data!

Here (and in all DA classes), we only care about RECTANGULAR DATA (“tidy data”):

3 things:

  • rows are CASES (individuals). Ie, from what person/animal/thing/whatev did we record? Eg: you guys! remember: rows are horizontal!
  • columns are VARIABLES. Ie, the qualities of the individuals that we care about! columns are vertical!
  • each “cell” must contain ONLY one observation/measurement

Types of Data

Big breakdown: numerical (quantitative) or categorical (qualitative) ie: number or not!

All together, 4 types:

  • discrete numerical data. “countable”. Actually counting real object/things/ideas.

  • continuous numerical data. “measureable”. Eg: weight, volume, duration of time, etc.

     Philosopical note:  be careful about WHAT you're measuring.  Ex:
     Age = integer number of years.  discrete.
     Age = duration of time living on earth.  continuous.  
     height = measuring tape.  continuous.  Possible to be 66.1", 66.7", 69.23423423"
    
     Weird example:  money.  gas?  pennies?  currency conversions? interest!
           Class agreement:  money is continous. 
  • nominal categorical. categorical, no INTRINSIC order.

  • ordinal categorical. categrocial, YES natural order

MORAL OF THE STORY: Not all data is the same. We use DIFFERENT TOOLS for different types!

Saying smart things about data (descriptive/summary stats)

Summary statistics for Categorical data: proportion!

Ex: In our class, 9/15 students had black hair.

phat:

9/15
## [1] 0.6

Summary statistics for Quantitative data:

For quant variables, always describe:

  1. center.
  2. spread.
  3. shape.

Friday Jan 20

Quantitative Variables (saying something smart)

Note: it doesn’t make sense to talk about ONE number in a dataset. Instead, we care about ALL of them, at the same time! “Distribution”.

ALWAYS MAKE A PICTURE FIRST!

For numerical data: HISTOGRAM!

Ex: class data:

head(ourData)
##            X Year.in.School   Major Number.of.Siblings       Name.of.Hometown
## 1 Student 01      Sophomore      DA                  2                Chicago
## 2 Student 02      Sophomore Physics                  1           Hilliard, OH
## 3 Student 03      Sophomore      DA                  3       Hillsborough, CA
## 4 Student 04         Junior     Bio                  1       Vungtau, Vietnam
## 5 Student 05      Sophomore    Econ                  1          Lakeville, MN
## 6 Student 06      Sophomore      CS                  1 Buon Ma thuot, Vietnam
##   Population.of.Hometown Hair.Color Age Height..inches.
## 1              2,700,000      Brown  20              74
## 2                 37,114      Black  20              70
## 3                 11,000      Brown  20              70
## 4                454,000      Black  20              65
## 5                 72,000      Brown  19              70
## 6                502,175      Black  21              69
##     Favorite.Childhood.Activity
## 1                        Sports
## 2                   video games
## 3                       Sailing
## 4                   video games
## 5                Playing sports
## 6 playing soccer, chess, boxing

Q: what can we say about your height?

Hist!

hist(ourData$Height..inches.)

Observations:
- typical height is around 67/68 - spread: looks like “most” values are btwn 62 - 72 - looks symmetrical “ish”

Q: what if we use more than 5 rectangles?

hist(ourData$Height..inches., breaks = 10)

Interesting new feature! Bimodal! Q: could it be because of sex?

MORAL OF THE STORY: Follow the “Tao of the Histogram”!

Ex: our siblings

hist(ourData$Number.of.Siblings, breaks = c(0,.99,1.99,2.99, 3.99))

Tao of the histogram!

Summary Statistics

Ie, numerical summary of distribution.

  • Measures of Center
    • Mean: average! “balancing point”
    • Median: “middle number”
    • Mode: most common number
  • Measures of Spread
    • Range = max - min. Con: highly susceptiple to outliers.

    • Standard deviation = “average distance btwn points and the mean” Formula: on the board Note: must square to measure the TOTAL variation!

    • IQR. Quartiles: cutoffs for 25% (Q1) and 75% (Q3) in the dataset. IQR = Q3 - Q1

      Ex: Data = {15, 16, 16, 18, 20, 27}

      Compute IQR = 20 - 16 = 4

      (M = 17, Q1 = 16, Q3 = 20)

      Meaning: range of the MIDDLE 50% of the data!

A statistic is ROBUST if it is NOT strongly affected by skew or outliers.

Ex: Data = {1,1,1,1,1,19}

Mean = 4, Med = 1. Seems like median is better!

Week 2

Wed Jan 25

Distribution: The salaries of all employees at a major corporation.

Which would provide a rosier picture of compensation: the mean or the median?

A: expect strong right skew (tail)! expect mean to be bigger (since not robust).

Outliers

Q: what “counts” as an outlier? Need a test!

Q: There are two sections of calculus. Both take the same exam, both class means are the same.

Is it possible that a student who scores 95 in one section could be an outlier, but a student with same score in the other is NOT an outlier?

Yes! Must factor in the spread!

Let’s use IQR! (robust!)

“Cutoffs” for outliers:

low = Q1 - 1.5(Q3-Q1) IQR hi =. Q3 + 1.5(Q3-Q1) IQR

Ex: Data = { 1, 2, 3, 4, 5, 6, 7, 8, 8, 20}

Q: are there outliers?

M = 5.5 Q1 = 3, Q3 = 8, IQR = 8-3 = 5

Cutoffs:

low:

3 - 1.5*5
## [1] -4.5

hi:

8 + 1.5*5
## [1] 15.5

Boxplots

Great for comparison!

Ex: iris dataset

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Ex: sepal length

boxplot(iris$Sepal.Length, horizontal = T)

Q: compare species?

boxplot(iris$Sepal.Length ~ iris$Species, horizontal = T)

Ch 2 - Probability!

Plan: stay to close to book!

Key building blocks:

A random process / chance experiment is any scenario where more than one unknown outcome could occur!

Ex: Tossing a coin, rolling dice, sampling random student

Sample space = S = the set of all possible outcomes Event. = A = any subset of the sample space (what we care about!)

Ex: Rolling a dice. Let A be the event that we get an even number. Let B be the event that we get a prime.

S = {1,2,3,4,5,6} A = {2,4,6} B = {2,3,5}

goal: combine events into useful statements

3 tools:

  • Complement. “not”. A’ is everything in S, NOT in A
  • Union. A U B = all elements in either a OR b (could be both)
  • Intersection A intersect B = all elements in both A AND B

Find:

A’ = {1,3,5} A U B = {2,3,4,5,6} A intersect B = {2}

Friday Jan 27

2.2 Axioms of probability

The very basics!

For some event A, the probability of event A, denoted P(A), satisifies…

1). P(A) >= 0 (for any event A in our sample space S) 2). P(S) = 1. (since S is all possibilities) 3). If A, B are mutually exclusive, then:

        P(A U B) = P(A) + P(B)

Ex) Using only the axioms, prove that P(A’) = 1 - P(A)

Ex) Axioms, prove P(A) <= 1 for any A

Ex). P(A U B) = P(A) + P(B) - P(A and B)

Exercise: prove from axioms! (it’s in the book!)

Ex). Cards! If we draw at random, what’s the prob that our card is a heart or a face?

P(H U F) = P(H) + P(F) - P(H and F)

        = 13/52 + 12/52 - 3/52
        

2.3 - Counting Techniques

The PRODUCT RULE: if we make a sequence of k decisions, - with n1 options for the first, - n2 options for the second, …. - nk options for the kth decision.

If the decision at any one stage doesn’t affect any other, then there are:

     n = n1 * n2 * n3 * ... * nk total possibilities
     

Ex). We toss a coin 3 times. How many possible outcomes? Ex: H H H

Ex). Roll 2 dice. How many outcomes?

      6 * 6 = 36
    

Ex). In OH, license plates have 3 letters followed by 4 numbers.

How many plates are possible?

26^3*10^4
## [1] 175760000

Ex). Above, but no letter “O”, can’t end on an odd number.

Permutations

Ex) How many arrangements of the letters in ALICE are there?

Ex) How many arrangments of the letters in BANANA are there?

Ex) There are 20 children. 3 swings. How many seating arrangements are there?

Combinations

Ex). There are 20 children. We need to pick a team of 3 children. How many teams?

Problem: teams have no arrangement/order!

What’s the difference? ORDER!!!

Perms - order matters!

Combs - doesn’t

Week 3

Monday Jan 30

Ex). In a tennis league, there are 16 players. How many possible matchings are there?

Ex2). Same league. After a competition, someone gets 1st place, 2nd place, 3rd place. How many possible awards brackets?

Ex3). We toss 8 coins. What’s the probability of getting:

       H H H H T T T T
       
       H T H T H T H T
       

Ex4). We toss 8 coins. WHat’s the probability of getting exactly 4 heads?

P(Exactly 4 heads) = 70/2^8

70/2^8
## [1] 0.2734375

Ex5) We toss 10 coins. What’s the probabilit of exactly 3 Heads?

Future: binomial dist!


Alternate version:

Ex3). Roll a dice 10 times. What’s the probability that the first 3 rolls show “4”?, no others are 4.

Ex: 4 4 4 1 2 3 2 1 5 6

Ex4). What’s the prob that the LAST 3 rolls show “4”?

Same thing!

Ex5). What’s the prob that EXACTLY 3 of the 10 rolls shows “4”?

Future: binomial distribution


Card Example

Ex6). Cards.

Draw a 5-card hand (from 52 card deck).

a). How many possible hands.

choose(52,5)
## [1] 2598960
  1. What’s the probability of getting one pair?

    Ex: 3 3 J A 7

    NO: 3 3 3 A 7

    No: 3 3 3 3 K

    NO: 3 3 J J K

Decisions:

- decide which kind for the pair?  13 C 1

- decide which for the pair?   4 C 2

- decide kinds for other 3 cards:  12 C 3

- for each of the three: which suit?  4C1 * 4C1 * 4C1

Final: 13C1 * 4C2 * 12C3 * 4C1 * 4C1 * 4C1

Practice: wikipedia! Google: “wiki poker probabilities”

2.4 Conditional Probability

Idea: if it’s cloudy, more likely to rain.

  P(A | B) = "the probability that A happens, GIVEN that B has occurred"
  
           = P(A and B) / P(B)
           
           

Ex) At DU, 52% of students are female, 48% male. About 8.2% of all students are female STEM majors.

If we select a student at random, and that student is female, what’s the probability that she’s a STEM major?

given: female!

Need: P(S | F) = P(S and F)/P(F) = .082/.52

Ex) Lie detector test. It advertises itself as 95% accurate. This means that IF a person is lying, then there’s a 95% chance of a + result.

However, we also know that if a person is NOT lying, there’s a 90% chance the test shows -.

Suppose that most people (99%) are honest.

  1. Draw a probability tree to summarize.

b). What’s the prob a rando person tests “+”?

Only options: L and +, or. L’ and +

.01*.95 + .99*.1
## [1] 0.1085

This is “Total Prob!”. Breaking it down into cases.

  1. If the test says you’re lying, what’s the prob that you’re actually lying?

P(L | + ) = P( L and + )/P(+)

Tree:

.01*.95/(.01*.95 + .99*.1)
## [1] 0.0875576

If you get +, there’s only an 8.7% chance that you really lied!

Q: Why so small?

2.5 Independence

Idea: somtimes, one outcome DOES NOT AFFECT another!

Independence!

Ex: coin tosses.

Math definition:

We say events A,B are INDEPENDENT if:

     P(A | B ) = P(A)
     
     P(B | A ) = P(B)

if one is true, so is the other.

Ex: poker and snacks

P(P|C)

12/30
## [1] 0.4

P(P)

25/115
## [1] 0.2173913

Q: are these different enough to suggest not independent?

Ex). Text (Example 2.36)

Friday Feb 3

Independence:

 P(A | B) = P(A)
 P(B | A) = P(B)
 

EASY WAY TO CHECK:

     P(A and B) = P(A)*P(B)
     

Ex) Cards! Are the events “draw a heart” and “draw a face card” independent?

Justify mathematically.

  Check:    P(H and F) =? P(H)*P(F)
  
                3/52   =?  13/52 * 12/52
                
3/52
## [1] 0.05769231
13/52 * 12/52
## [1] 0.05769231

Independent events!

Ch 3.1 - 3.2 Probability Distributions

Idea: think more holistically about probability.

RANDOM VARIABLE: map each outcome in sample space to real number.

Ex1. Roll dice. Let X = the number that shows. Possible values for X = {1,2,3,4,5,6}

Ex2. Roll two dice. Let X = their sum. values for X = {2,3,4,…,11,12}

Ex3. Toss a coin repeatedly until get H. Then stop. Let X = the number of coin tosses. {1, 2, 3, …. }

A discrete PROBABILITY DISTRIBUTION for a RV X is:

  • a list of all possible values for X
  • for each, an associated probability.

Notes:

  • the SUPPORT for the distribution is the set of all X with nonzero prob {1,2,3,4,5,6}

  • 0 <= P(x) <= 1 for all x in support

  • sum( P(x) ) = 1

Ex1-3) Construct distributions for the RVs listed above. [Solved on the board].

Cumulative distributions

CDF - cumulative distribution function

on board

Notes about CDF:

  • every cdf is defined on ENTIRE REAL LINE
  • left hand limit = ____
  • right hand limit = ____

Why cdf?

A: very useful for RANGES of options for X

Ie:

     P(a <= X <= b) = F(b) - F(a-)     <- book, p106
     
     

Ex: What’s the prob we roll a number btwn 2 and 5 (inclusive)?

    P(2 <= X <= 5) = F(5) - F(1) = 5/6 - 1/6 = 4/6
    
    

Week 4

Monday Feb 6

3.3 Expected Value and Variance for Discrete RV

Before: described data that happenED

Now: expectations for the future! prob!

Expected Value

E[X] = sum( x*P(X=x) )

Ex). Dice Distro.

1*1/6 + 2*1/6 + 3*1/6 + 4*1/6 + 5*1/6 + 6*1/6
## [1] 3.5

If we rolled the dice MANY times, we’d expect an average close to 3.5.

Variance

V[X] =

Ex) Find V for dice distro.

V[X]

(1-3.5)^2*1/6 + (2-3.5)^2*1/6 + (3-3.5)^2*1/6 + (4-3.5)^2*1/6 + (5-3.5)^2*1/6 + (6-3.5)^2*1/6
## [1] 2.916667

SD[X]

sqrt(2.91666666)
## [1] 1.707825

Linear Transformations of RVs

 E[ h(X) ] = sum( h(x)*P(X=x) )
 
 E[ X^2 ] = sum( x^2 * P(X=x) )
 

Linear trans: (p115)

 Y = aX1 + bX2 + c
 
 where X1, X2 are RVs, and a,b,c are constants
 
 then:
 
 E[Y] = a * E[X1] + b * E[X2] + c
 
 V[Y] = a^2 * V[X1] + b^2 * V[X2] 
 
 SD[Y] = sqrt(V[Y]).  (WARNING:  must find V first!)

Ex). A grad program values the math portion of entrance exam twice as much as the verbal.

Transf:

 Y = 2*X1 + 1*X2
 
     where X1 = math score, X2 = verbal
     

Info:

math:   mean = 120,  stdev = 5
verb:   mean = 160,  stdev = 7

Q: What are the mean and stdev of weighted score?

E[Y] = 2E[X1] + 1E[X2] = 2120 + 1160

2*120 + 1*160
## [1] 400

V[Y] = 2^2V[X1] + 1^2V[X2] = 2^2 * 5^2 + 1^2 * 7^2

2^2 * 5^2 + 1^2 * 7^2
## [1] 149

stdev:

sqrt(149)
## [1] 12.20656

Note: must get V first!

3.4-3.6 (Skip 3.5) Meet the Famous Prob Distros!

  1. Bernoulli.

    X ~ bernoulli(p) [NOTE: p is a “parameter” ]

Only two outcomes: X=0 (fail) or X=1 (success).

P(X = 1) = p       (prob of success)
P(X = 0) = 1-p

E[X] = p
V[X] = p(1-p)       <- check later!
  1. Binomial Dist

    X ~ binom(n,p)

ASSUMPTIONS:

  • there are a fixed number (n) of trials
  • only two outcomes for each trial: success or fail (bernoulli!)
  • all trials are independent
  • CONSTANT probability of success (p) in each trial

Ex). If we roll dice 10 times, what’s the prob that exactly 3 of them show “4”?

X~binom(10,1/6)

Wed Feb 8

Famous Distros

  1. Bernoulli

    Support: {0,1}

2). Binomial

X~binom(n,p)

X counts the number of successes out of n trials.

Ex). At DU, about 80% of students are from out-of-state.

If we take a sample of 6 students, what’s the prob that exactly 3 of them are from out-of-state?

X ~ binom(6, .8)

P(X=3)

20*.8^3*.2^3
## [1] 0.08192

Binomial facts:

Support: {0,1,2,..,n}

E[X] = n*p

V[X] = n*p*(1-p)
  1. Geometric Distribution

X ~ geom(p)

X = the number of trials needed until the first success.

Ex: We toss a coin repeatedly until get H. WHat’s the prob it takes 5 times?

P(T and T and T and T and H) = (1/2)^4 * 1/2

Ex). Suppose we choose students at random until we get one from Ohio. (remember: 80% from out-of-state).

What’s the prob that we must sample at least 3 students until we get the first from Ohio?

Support: {1,2,3,….}

Expected Value: E[X] = 1/p. ONLY FOR GEOMETRIC! Variance: V[X] = (1-p)/p^2

  1. Poisson Dist

X ~ pois(mu)

X = the number of occurrences of a random event over an interval of space/time

Assumptions:

  • events happen randomly and independently
  • constant chance of occurrance at any given time/place
  • the liklihood of occurance is proportionate to the length of the interval (more time -> more likely!)

Ex). Since 1883, Tokyo has experienced 5 earthquakes of the “most severe” category.

Assuming constant likelihood over time, what’s the probability that at least one occurs in the next year? In the next 10 years?

mu = 5quakes/140years = 5/140

P(X>=1) = P(X=1) + P(X=2) + ….. = 1 - P(X=0)

1-exp(-5/140)*(5/140)^0/1
## [1] 0.03508406

b). next 10 years?

idea: unit changed! want quakes/10year period

mu = 10*5/140 = 50/140

1-exp(-50/140)*(50/140)^0/1
## [1] 0.3003275

Ch 4 - Continuous Distributions

Ex. Height is cts.

In US, avg height for women is 64”, stdev = 2.4”.

What’s the prob that a rando woman is EXACTLY 64.0000000000….”

A: zero! weird!

True for ALL cts dist:

 P(X = a) = 0        for any a!
 

So, only makes sense to talk about RANGES for cts variables.

What to do? probability DENSITY function!

Friday Feb 10

Ch4: Continuous Dists

Idea: need a DENSITY function. f(x). pdf. probability = AREA under the curve!

Uniform Dist

“straight line dist”

Ex) Suppose the time you wait at a bus stop varies uniformly btwn 5 and 15 min.

X~unif(5,15)

a). sketch the pdf

  1. What’s P(X<7)

c). What’s P(X<=7)

d). What’s P(X=7) = 0

In general:

X cts:     P(a <= X <= b) = P(a < X < b)

WARNING:  not true for X discrete!

Ex). Suppose f(x) = kx^2 0 <= x <= 2. (0 otherwise.) What value of k makes f(x) a valid pdf?

Ex). Compute P(X<1) = F(1)

4.2: Cdf, E[X], V[X]

cdf = all probability to the lEFT

F(x) = P(X <= x)

Ex). Find cdf for above dist.

f(x) = (3/8) x^2 0 <= x <= 2. (0 otherwise.)

Ex). Find P( 1/2 < X < 1 ). = F(1) - F(1/2)

                        = 1^3/8 - (1/2)^3/8
                        

Ex). Find cdf for X~unif(a,b)

E[X] and V[X]

Before: X discrete:

E[X] = sum( x*P(x) )

Now: replace sum with integral!

V[X] = sum( (x-mu)^2 * P(x) )

Now: replace with integrals!

Ex. Compute E[X] for above

f(x) = 3/8 x^2 0 <= x <= 2. (0 otherwise.)

4.13e) (p155) What’s the prob that headway (X) is within 1 standard deviation of the mean value?

Ex). Draw a 5-card hand at random. What’s prob of 1 pair?

Ex: 3 3 7 J Q

NO: 3 3 3 J Q

NO: 3 3 7 7 Q

Decisions:

  • which kind for pair? 13C1

  • which 3s? 4C2

  • what kind for remaining? 12 C 3

  • for each card: 4C1. (three of them)

Prob:

( 13C1 * 4C2 * 12C3 * (4C1)^3 ) / 52C5

(number of ways for 1 pair) / total number of 5-card hands

Since 1883, tokyo had 5 serious quakes. What’s the prob of at least 8 quakes in the next year?

X ~ pois(mu=5/140)

P(X >= 8) = P(8) + P(9) + P(10) + …

      = 1 - P(7) - P(6) - ... - P(1) - P(0)
      
      
      
      

Ex) Draw a 5-card hand. What’s the prob of getting 1 pair?

Ex: 3 3 7 J K

No: 3 3 3 7 J

No: 3 3 7 7 J

Decisions:

  • which kind for the pair? 13C1

  • which couple to use? 4C2

  • what kind for other 3? 12C3

  • for each, which suit? 4C1 for each!

P(pair) = ( 13C1 * 4C2 * 12C3 * (4C1)^3 ) / 52C5

  1. blah blah blah has uniform distribution on (7.5,20) as a model of depth.

a). what are mean and variance of depth?

20^2/25 - 7.5^2/25
## [1] 13.75
20^3/(12.5*3) - 7.5^3/(12.5*3)
## [1] 202.0833
202.08 - 13.75^2
## [1] 13.0175

Week 5 (Exam I Week)

Fri Feb 17

4.3 Normal Distribution

Ad: shows up in nature allll the time!

Idea: Most of us are in the middle! More extreme is less likely.

Note: Tail area is non-negative allll the way down to +/- infty.

Fortunately, gets small quickly.

pdf/cdf for normal dist

problem: f(x) is NOT INTEGRABLE BY HAND!

Need computational aid:

  • table of values
  • software (r)

Problem! WHat are mu and sigma?

In general: from the problem.

Table: STANDARD NORMAL DIST!!!

   mu = 0,  sigma = 1
   

Ex). Suppose Z~normal(0,1)

(ie, std normal!)

a). P(Z < 2.76) = .9971

  1. P(Z > 2.76) = 1 - .9971
1-.9971
## [1] 0.0029
  1. P(Z<=2.76) = same as a

d). P( -0.25 < Z < 1.87 )

F(1.87) = .9693 F(-0.25) = .4013

.9693 - .4013
## [1] 0.568

Backwards problems

Suppose Z has std normal dist.

a). How large would Z have to be in order to be at the 90th percentile?

z = 1.28

Note: won’t always be exact. Pick the closest.

b). How large would z have to be in order to be in the 40th percentile?

z = -.25

z-scores

Problem: real world data NOT standard normal!

Solution: z-score!

z-scores CONVERT to std normal dist!

Ex). Height for adult women in US follows normal dist with mean 64”, stdev 2.4”.

a). What’s prob that rando woman is less than 65” tall?

P(X<65) = ?

z = (65 - 64)/2.4

(65 - 64)/2.4
## [1] 0.4166667

look up in table! ->. left area = .6628.

About a 66.3% chance.

b). What’s the prob that rando woman is at least 75” tall? z

(75 - 64)/2.4
## [1] 4.583333

uh-oh! off the charts!

area < .002

area ~ 0

  1. How tall must a woman be in order to be in the top 5% of womens’ heights?

First: find z-score from table!

halfway btwn: 1.64 and 1.65. average them! z = 1.645

now: solve for x!

1.645*2.4 + 64
## [1] 67.948

She’d have to be at least 67.948” tall.

Week 6

Monday Feb 20

4.3 Normal dist

normal dist in r.

Ex. If Z has std normal dist, Z~norm(0,1)

  1. (Forwards). P(Z<1.49) = .9319

in r: “pnorm(zscore)”

pnorm(1.49)
## [1] 0.9318879
  1. (Backwards) How large must Z be to be at the 60th percentile?

z = .25

in r: qnorm(percentile)

qnorm(.6)
## [1] 0.2533471
  1. Heights for men in US follow normal dist with mean 69”, stdev 2.7”.

What’s the prob that rando man is btwn 66” and 70” tall?

find z-score!

x=70:

(70-69)/2.7
## [1] 0.3703704

x=66

(66-69)/2.7
## [1] -1.111111

P(-1.11 < Z < 0.37) =

pnorm(.37)
## [1] 0.6443088
pnorm(-1.11)
## [1] 0.1334995

area:

pnorm(.37)-pnorm(-1.11)
## [1] 0.5108092

About 51.1% of men are btwn 66” and 70”.

  1. In what range of heights (low - to - high) are the middle 50% of mens’ heights?

z-score:

lo

qnorm(.25)
## [1] -0.6744898

hi

qnorm(.75)
## [1] 0.6744898

Got z, now solve for x!

lo:

-.67*2.7+69
## [1] 67.191

hi:

.67*2.7+69
## [1] 70.809

The middle 50% of mens’ heights are btwn 67.2” and 70.8”.

IQR:

70.8-67.2
## [1] 3.6

4.3 Normal (cont) Normal approximation to the Binomial Dist

Idea: for large n, then binomial looks just like normal!

Can be VERY CONVENIENT to replace complicated binomial with simple normal!

Ex) We roll a dice 100 times. What’s the prob that no more than 20 rolls show “4”?

Note: here, n is “large”!

  • n*p > 10. <- at least 10 successes
  • n*(1-p) > 10 <- at least 10 failures

mu

100*1/6
## [1] 16.66667

sigma

sqrt(100*1/6*5/6)
## [1] 3.72678

Find z-scores!

P(X<=20)

(20-16.667)/3.727
## [1] 0.8942849

area from table:

pnorm(.89)
## [1] 0.8132671

There’s about 81.3% chance of no more than 20 “4”s in 100 rolls.

b). What’s prob that we get EXACTLY 17 “4”s in 100 rolls?

X~norm(16.667, 3.727)

P(X=17) = 0

Fix: “continuity correction”

Fudge: P(16.5 < X < 17.5)

z-scores!

17.5:

(17.5-16.667)/3.727
## [1] 0.2235042

16.5:

(16.5-16.667)/3.727
## [1] -0.04480816

area:

pnorm(0.22)-pnorm(-0.04)
## [1] 0.1030179

There’s about a 10.3% chance of getting 17 “4”s in 100 rolls.

d). Use continuity correction to compute P(X<=20)

z-scores:

(20.5-16.667)/3.727
## [1] 1.028441
(-.5 - 16.667)/3.727
## [1] -4.606118

area:

pnorm(1.03) - pnorm(-4.61)
## [1] 0.848493

Summary of example calculations

roll dice 100 times. count “4”s.

X~binom(100,1/6)

X~norm(16.667, 3.727)

P(X<=20)

exact:

pbinom(20,100,1/6)
## [1] 0.8481122

normal approx, no cont correction:

pnorm(20,16.667,3.727)
## [1] 0.8144153

with cont correction:

pnorm(20.5,16.667,3.727)-pnorm(-.5,16.667,3.727)
## [1] 0.8481268

Wed Feb 22

4.4 Exponential DIst

Recall: X~pois(mu = avg rate of occurrences)

X = how many occurrences in unit of time/space

support: {0,1,2,…}

Now: X~exponential(lambda = avg rate of occurrences)

X = amount of time until the next occurrence

Ex). An urgent care center receives an average of 2.3 patients/hour.

a). What’s the prob that at least 2 patients arrive in the next hour?

X~pois(mu=2.3)

P(X>=2) = 1 - P(X<2) = 1 - P(0) - P(1)

1-exp(-2.3) - exp(-2.3)*2.3
## [1] 0.6691458

b). What’s the prob we wait less than 1.5 hours for the next patient?

X ~ exp(lambda = 2.3)

-exp(-2.3*1.5)+1
## [1] 0.9682544

E[X] and V[X] for exp

cdf

4.6 probability plots

Idea: what dist does our data have?

Care about: is our data NORMAL?

Look at histo!

Ex). iris. Q: is sepal length normal?

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

hist:

hist(iris$Sepal.Length)

Opinion: normal-ish. Unimodal, roughly symmetric.

Normal quantile plot (qqnorm)

x-y plot:

x = normal percentiles. (qnorm) y = original data

Ex: {5,6,7,8,9}

P10:

qnorm(.1)
## [1] -1.281552

P30:

qnorm(.3)
## [1] -0.5244005

And so on for P50, P70, P90.

Friday Feb 24

4.6 Prob plots

Ex: {5,6,7,8,9}

qqnorm(c(5,6,7,8,9))

Ex). iris sepal length

hist(iris$Sepal.Length)

qqnorm(iris$Sepal.Length)

Q: is it linear?

A: sure, why not.

Ex). iris petal length

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

hist:

hist(iris$Petal.Length)

qqnorm(iris$Petal.Length)

Oof! Super not linear, data does -not- follow a normal dist.

Skip 5.1-5.2

5.3 - Sampling Distributions

So far: we’ve talked about INDIVIDUALS.

Ex). What’s prob that rando woman is at least 65” tall?

Now: want to talk about entire SAMPLE STATISTICS (think: xbar, phat).

Ex. In a sample of 10 women, what’s the prob their AVERAGE HEIGHT is at least 65” tall?

A SAMPLING DISTRIBUTION is a prob dist that describes the likelihood of entire SAMPLE RESULSTS (xbar, phat).

Ex) Roll 2 dice, take average = xbar. n = 2. Construct the samp dist for xbar.

CENTRAL LIMIT THEOREM

CLT: As long as n is “large enough”, then the samp dist for xbar follows an approximately normal dist, NO MATTER WHAT POPULATION WE’RE SAMPLING FROM!!!!!

mean: “the mean of the mean is the mean” “our expectation for xbar is the same as the population mean, mu”

variance: cut down by factor of n

Ex) What’s prob that rando woman is at least 65” tall?

Recall: mu = 64, sigma = 2.4.

z-score:

(65-64)/2.4
## [1] 0.4166667

right area:

1-pnorm(.42)
## [1] 0.3372427

Ex) Sample n=10 women. WHat’s the prob their AVERAGE height is at least 65”?

xbar ~ normal

mu_xbar = 64 sigma_xbar = 2.4/sqrt(10)

z-score!

(65-64)/(2.4/sqrt(10))
## [1] 1.317616

right area:

1-pnorm(1.32)
## [1] 0.09341751

About 9.3% chance that 10 women have mean height at least 65”.

Is n large enough?

2 cases:

  • If we’re sampling from a normal population (think: height), then xbar is normal no matter what, even for small n.

  • If we’re sampling from a NON-NORMAL (or UNKNOWN) population, then NEED LARGE n in order to use normal dist (CLT).

    "large enough":  n >= 30

Week 7

Monday Feb 27

5.3-5.5 Sampling Distributions

Last: samp dist for xbar

Now: samp dist for phat (sample proportion)

CLT for phat

If sample size (n) is large enough, then phat follows an (approximately) normal dist, NO MATTER WHAT POPULATION WE’RE SAMPLING FROM!!!

Large enough:

  • n*p >=10. (at least 10 successes)
  • n*(1-p) >=10 (at least 10 failures)

z = (obs - exp)/stdev

mean: expectation for sample proportion is the pop proportion!

Ex). Voters in a particular county are 55% R. 45% D.

If we take a sample of 100 people, what’s the prob that half or fewer are R?

P( phat <= .5 )

Is n large enough?

n*p:

100*.55
## [1] 55

n*(1-p)

100*(1-.55)
## [1] 45

normal dist! get z-score!

z = (obs - exp)/stdev

obs = .5

exp = .55

stdev = sqrt(.55*(1-.55)/100)

(.5 - .55)/sqrt(.55*(1-.55)/100)
## [1] -1.005038

find are on table:

pnorm(-1.01)
## [1] 0.1562476

About 15.6% chance that fewer than half in our sample are R.

Ch 6: Parameter Estimation

Q: what the heck ARE all these parameters we’re talking about?

mu? sigma? p? lambda?

A: guess (estimate!) based on data!

Think:

Best guess for mu is: xbar!

Best guess for p: phat

Best guess for sigma: s (samp stdev)

Best guess for lambda (exponential dist): ???????

A POINT ESTIMATE is a single number representing a reasonable guess about a parameter.

Big question: how to find it? The “best guess” is not always obvious!

3 strategies:

  • unbiased estimators
  • method of moments
  • max likelihood estimators (MLEs)

Unbiased

Ex). Show that xbar is an unbiased estimator for mu.

Ex). Show that phat is unbiased estimator for p.

Practice: Show that s^2 is unbiased estimator for sigma^2 (solution: p253)

Method of Moments

Example 6.12

Wed Mar 1

6.2 Parameter estimation

  • unbiased
  • m.o.m
  • m.l.e

Method of Moments

2 kinds:

  • moment for distribution
  • moment for data

MoM: match up the moments! solve for parameters!

Note: how many moments do we need? answer: as many as the number of parameters.

Ex). Suppose x1, x2, …, xn are sampled from EXPONENTIAL distrubion. Use MoM to estimate lambda.

Ex). x1, … xn are sampled from NORMAL dist. Use MoM to estimate mu and sigma.

Max Likelihood Estimator (MLE)

Likelihood function:

probability of observing OUR sample result (x1, x2, ..,xn) GIVEN the value of the params

MLE: find choice of params that MAXIMIZES (calculus!) the likelihood function

Ex). Use MLE to find estimate for lambda in exponential dist.

1). get joint pdf

  1. sneaky trick: take log of both sides

3). take deriv, set =0. (ie find max!) [NOTE: take deriv with respect to LAMBDA!!!]

Ex). 6.22 (p273)

Find estimate for theat using a) MoM and b) MLE

Friday Mar 3

6.2 MoM and MLE

Ex) Use MLE to estimate mu and sigma for normal dist.

Last time, for both unbiased AND M.O.M:

mu = xbar sigma = s

WARNING: must take deriv with respect to BOTH mu and sigma (not at the same time).

Idea: two equations, two unknowns!

Conclusion: MLE estimator for mu is the same (xbar), BUT the estimator for sigma is DIFFERENT! The MLE is BIASED!!!

Ch7 - Confidence Intervals

Idea: what’s a RANGE of reasonable values for the parameter (mu, sigma, p, lambda, etc etc), how sure are we the true value is in there?

Ex). Height for adult men is normal with mean 69”, stdev 2.7”.

mu = 69, sigma = 2.7

If we take a sample of n=36 men, what are the cutoffs for the middle 95% of such sample means?

Q: is xbar normal?

A: two cases:

  • if X is normal, then so is xbar (even small n)
  • if n is large (>=30), then xbar is normal for any population

Middle 95% -> tail area = 2.5% -> z = +/- 1.96

qnorm(.025)
## [1] -1.959964

lo:

69-1.96*2.7/sqrt(36)
## [1] 68.118

hi:

69+1.96*2.7/sqrt(36)
## [1] 69.882

Conf Int

Idea: capture the middle 90%/95%/99% of the sampling distribution.

What are reasonable values of mu based on xbar?

Problem: what about sigma?

Temp solution: for large n (n>=30), s is a good approximation for sigma.

Ex) A biologist wonders: what’s the mean height of reticulated giraffes? Collect data on 40 giraffes, their mean height is 18.2’, stdev 1.7’.

Construct a 95% C.I for the true mean height of giraffes.

Here: critical value = z_alpha = 1.96

plug and chug!

low:

18.2 - 1.96*1.7/sqrt(40)
## [1] 17.67316

hi:

18.2 + 1.96*1.7/sqrt(40)
## [1] 18.72684

We are 95% confident that the true population mean of all reticulated giraffes is btwn 17.67 and 18.72 ft tall.

Is this the same as: ??

There’s a 95% chance that mu is btwn 17.67 and 18.72.

NO! To be continued…

ourData <- sample(1:6,30, replace=TRUE)

mean(ourData) - .67*sd(ourData)/sqrt(30)
## [1] 3.546203
mean(ourData) + .67*sd(ourData)/sqrt(30)
## [1] 3.920463
qnorm(.25)
## [1] -0.6744898

Week 8

Monday March 6

Ch 7 - Confidence Intervals

Idea: want best guess for pop parameter based on data!

ESTIMATE!!!!

Ex). Prof Miller wonders: what’s the true mean volume of coffee she pours in the morning?

Collects data on 50 days. Mean volume = 8.2, stdev 1.7oz.

Estimate the true mean volume with 99% confidence.

tail area = .005 -> 2.58

qnorm(.005)
## [1] -2.575829

plug and chug!

8.2 - 2.58*1.7/sqrt(50)
## [1] 7.579726
8.2 + 2.58*1.7/sqrt(50)
## [1] 8.820274

We are 99% confident that the true mean volume of morning coffee is between 7.58oz and 8.82oz.

What does CONFIDENCE really mean?

Ex). Dice rolls! Goal: use sample data to approximate mean value of dice throw.

Note: we know that mu = 3.5

dice sample: (n=10)

sample(1:6, 10, replace = T)
##  [1] 4 4 4 3 2 5 2 5 3 5

Take sample of size n=50, construct 50% CI:

ourData <- sample(1:6, 50, replace = T)
ourData
##  [1] 2 5 6 6 2 4 6 5 1 4 2 6 2 3 3 3 4 6 5 4 5 4 1 2 4 5 4 6 5 5 2 2 1 3 5 2 5 2
## [39] 6 4 4 1 2 5 6 4 1 6 3 1
mean(ourData) - qnorm(.25)*sd(ourData)/sqrt(50)
## [1] 3.861522
mean(ourData) + qnorm(.25)*sd(ourData)/sqrt(50)
## [1] 3.538478

MORAL OF THE STORY: the cofindence level is about the METHOD used for constructing CI, it is NOT about the individual bounds you calculated.

You have NO IDEA if your interval was successful or not.

Margin of error

Math: everyting after the +/- in formula.

Goal: make MOE small! Two factors that affect MOE:

  • sample size (n). As sample size increases, the MOE for CI decreases

    Think: yaaaay!

  • confidence level. As confidence increases, the MOE for CI BIGGER MOE!!!

    Think: bummer!

Planning ahead for sample size

Recall, the MOE in coffee examle was:

2.58*1.7/sqrt(50)
## [1] 0.6202741

Suppose we wish to conduct a follow-up study that estimates the true volume within 0.1 oz, at 99.9% confidence.

How large must our samle be?

Idea: plug in to MOE formula, solve back for n.

Info:

MOE < .1 z = 3.29

qnorm(.0005)
## [1] -3.290527

prior estimate for sigma = 1.7oz

plug in!

(3.29*1.7/.1)^2
## [1] 3128.165

Need n at least 3129.

Note: might need to compromise! Maybe either less confidence, OR accept a bigger margin of error.

Wed March 8

CH 7: CIs

So far: CI for mu (large n)

Q: wait. if we don’t know mu, how the heck do we know sigma?

A: swap out for s (ok if n large, since s is unbiased).

Q: what if n is small?

before:

z = (xbar - mu)/(sigma/sqrt(n))

now:

t = (xbar - mu)/(s/sqrt(n))

the t dist

The t dist comes from using s instead of sigma.

Derivation: grad school!

Neat t facts:

  • t dist looks awful lot like z dist
  • weird thing: shape of the t dist depends on the sample size
    • n larger -> t gets closer to normal. (basically no difference after n=100)
  • t dist has thicker tails than normal dist (always)

Q: are t-intervals wider or narrower than z-intervals?

A: t always wider! more stuff in the tails!

Think: bummer! bigger MOE with t dist!

It’s the price we pay for not knowing sigma.

t table

WARNING: t table is BACKWARDS from z table!

In the t-table:

  • values IN the table are t-scores
  • area on top margin

Note: need DEGREES OF FREEDOM

 df = n-1
 
 

Ex). Suppose we wish to estimate mu with 95% confidence from sample with n=15.

Note: it’s ESSENTIAL to use t here, not z!

t = 2.145

qt(.025,14)
## [1] -2.144787

Ex) Estimate mu with 99% confidence, n=20.

What’s t?

t = 2.861

Ex) A researcher wonders, what’s the mean height for women in Sweden?

Collect data from 12 women in Sweden, find mean height 65.2”, stdev 2.5”.

Estimate the true mean height with 90% confidence.

t = 1.796

65.2 - 1.796*2.5/sqrt(12)
## [1] 63.90385
65.2 + 1.796*2.5/sqrt(12)
## [1] 66.49615

We are 90% confident that true mean height for Swedish women is btwn 63.9” and 66.5”.

Conf Int for p

Ex). What proportion of DU students will travel out-of-state for break?

random sample of 100 students: 53 of them say “yes”, travel.

Estimate the true proportion of travelling students with 94% confidence.

qnorm(.03)
## [1] -1.880794

z = 1.88

plug and chug!

.53 - 1.88*sqrt(.53*(1-.53)/100)
## [1] 0.4361694
.53 + 1.88*sqrt(.53*(1-.53)/100)
## [1] 0.6238306

We’re 94% confident that the true prop of travelling students is btwn 44% and 62%.

Q: what if we wanted smaller MOE?

MOE:

1.88*sqrt(.53*(1-.53)/100)
## [1] 0.09383065

Plan ahead for sample size

Ex). Suppose we wish to conduct a future study that estimate the prop to within 1% and 95% confidence.

Idea: plug in to MOE formula, solve back for n

prior estimate for p = .53

z = 1.96

MOE = .01

.53*(1-.53)*(1.96/.01)^2
## [1] 9569.426

Q: what if no prior estimate for p?

A: use = .5. Worst-case scenario.

Week 9

Monday March 20

Recall CLT: As long as n is large, then our sample statistics (like xbar and phat) follow a normal dist, NO MATTER WHAT POPULATION WE’RE SAMPLING FROM!!!

Ch 8: Hypothesis Tests

Ex). A researcher suspects that women in sweden have higher mean height then women in the US.

Data: a random sample of 50 swedish women have mean height of 65.1”, stdev 2.8”.

Does this provide evidence that mean height for swedish women is above 64” (mean for the US?)

  1. Hypotheses

H0: mu = 64 (no difference) Ha: mu > 64 (claim/suspicion being investigated)

  1. Test statistic (summary of our data)

z-score! (er, actually, the t score here, since we don’t know sigma)

t = (obs - exp)/stdev = (65.1 - 64)/(2.8/sqrt(50))

(65.1 - 64)/(2.8/sqrt(50))
## [1] 2.777919
  1. p-value

p-val = probability of observing a sample result as (or more) extreme than ours, ASSUMING THE H0 IS TRUE.

Use “pt” command:

right tail:

1-pt(2.78, 49)
## [1] 0.003844915

On exam: use t table.

Interpret: If it’s true that mean height for swedish women is 64”, then the probability of observing a sample mean as big as ours (65.1”) is only 0.38%.

Note: looks like H0 and sample data DISAGREE!!

  1. Conclusion

Since our p-val is small (p-val < .05), we REJECT H0. We found strong evidence that mean height for swedish women is greater than 64”.

Two big methods for statistical inference:

  • conf int: estimate a parameter
  • hyp test: answer a question/investigate a claim

Comparison to Criminal Trial

1). Hypotheses

We assume that the defendant is innocent (H0), BUT we think they’re guilty (Ha)!

H0: default assumption (no difference/ NOT GUILTY!!!) note: H0 always has “=”. (mu =…., p=….., sigma =…..)

Ha: out claim/suspicion/question (GUILTY!!) note: always inequality: >, <, !=

2).

In criminal trial: witness give testimony. In hyp test: data!

jury deliberates: is evidence convincing? hyp test: is our p-val small enough?

Note:
- small p-val are evidence AGAINST H0 - big p-val aren’t evidence of anything

Are we convinced “beyond a shadow of a doubt”???

Hyp test: is p-val < significance level ??? (alpha=.05) If p<alpha -> reject H0. If p>=alpha -> fail to reject H0.

WARNING: each conclusion must have TWO SENTENCES:

  • reject/fail to reject H0
  • is there evidence to support Ha?

Wed March 22

Ch 8 - Hyp Tests

Ex). A researcher suspects that instrumentalists (musicians) have a higher mean IQ than the general population (100).

Data: in a sample of 100 musicians, mean IQ of 103, stdev 15.2.

Does this provide strong evidence to support their suspicion?

Perform a complete hyp test.

  1. Hypotheses

H0: mu = 100 Ha: mu > 100

2). Test stat: (t/z score!)

(103 - 100)/(15.2/sqrt(100))
## [1] 1.973684

3). p-val

note: df = 100-1 = 99

Not on table! Use next smallest: df = 60.

tail area: btwn .025 and .05

.025 < p-val < .05

p-val = prob of getting a sample like ours, if H0 is true

If it’s really true that musicians have mean iq of 100, then there’s only a 2.5%-5% chance of observing a sample result as extreme as ours.

4). Since p-val < .05, we REJECT H0! We have strong evidence that mean iq for musicians is greater than 100.

Hyp tests for p (pop proportion)

Ex) A student suspects that fewer than half of DU students vote R.

To investigate, random sample of 60 DU students. Of these, 27 vote R.

Does the data support the claim? Hyp test!

  1. Hypotheses

H0: p = .5 Ha: p < .5

  1. Test stat

Note: always Z for proportions

z = (obs - exp)/stderr

(27/60 - .5)/sqrt(.5*(1-.5)/60)
## [1] -0.7745967
  1. p-val
pnorm(-.77)
## [1] 0.2206499

p-val = .221

If it’s true that 50% of DU students vote R, then there’s a 22.1% chance of getting a sample result like ours.

4). Since p-val > .05, we fail to reject H0. We did not find strong evidence that fewer than half of DU students vote R.

Errors in Hyp Tests

Conclusion from test: not always true!

Ex). An EPA enforcement officer suspects that a local well water supply contains too much of Toxin A (the federally mandated level is 0.8 g/ml).

To investigate, she collects water samples (n=50), and computes average concentration of Toxin A.

Hyp test!

Explain, practically, what it would mean for her test to result in a Type I error? Type II error?

Two issues for each:

  • what do we THINK about the water?
  • what’s ACTUALLY TRUE about the water?

Type I: We conclude/think that the water is unsafe (too much toxin). BUT, actually the water conforms to safety standards (it’s fine).

Type II: We think the water’s fine, but actually it’s dangerous!

Arguments:

Type II is worse, since people drink dirty water!

Type I is worse, since employee could be fired! Type I worse, since wasting precious tax payer dollars!

Ex). Criminal trial!

Practically, what do type I and type II really mean?

H0: innocent Ha: guilty

  • Type I: We think they’re guilty, but really they’re innocent!
  • Type II: We think they’re innocent, but really they’re guilty!

Arguments:

Type I is worse since innocent people shouldn’t go to jail!

Type II is worse since, since a dangerous person could be set free!

Moral of the story: which is worse? IT DEPENDS!

Upshot: you (the investigator) can tip the scales!

Final Project

Big idea: investigate a question/theses, using a dataset!

My recommendation: pick an ineresting (to you!) public dataset.

The format you NEED!

” .csv ”

Rectangular data!

Example: customer data:

Click here for Kaggle page

Load data: Note: you need to tell R where the csv file is. I put mine in my downloads folder. If you are confused about this, get help in office hours!

myData <- read.csv("~/Downloads/Customers.csv")
head(myData)
##   CustomerID Gender Age Annual.Income.... Spending.Score..1.100.    Profession
## 1          1   Male  19             15000                     39    Healthcare
## 2          2   Male  21             35000                     81      Engineer
## 3          3 Female  20             86000                      6      Engineer
## 4          4 Female  23             59000                     77        Lawyer
## 5          5 Female  31             38000                     40 Entertainment
## 6          6 Female  22             58000                     76        Artist
##   Work.Experience Family.Size
## 1               1           4
## 2               3           3
## 3               1           1
## 4               0           2
## 5               2           6
## 6               0           2
hist(myData$Spending.Score..1.100.)

Week 10 (Exam II Week)

Friday March 31

Two tail tests

Two tail tests:

Ha: !=

Ex) A researcher wonders if mean IQ for musicians DIFFERS from the general population (100).

Data: 50 musicians, their mean iq is 102.7, stdev 11.2.

Does this support claim/suspicion?

  1. Hypotheses

H0: mu = 100 Ha: mu != 100

  1. test stat

t = (obs - exp)/stdev

(102.7-100)/(11.2/sqrt(50))
## [1] 1.704632
  1. p-val

right tail area:

1-pt(1.70, 49)
## [1] 0.04773554

p-val = 2*.048 = .096

4). conclusion

Since p-val > .05, we fail to reject H0. There’s insufficient evidence that mean iq for musicians differs from 100.

Q: when to use one-tail vs two tail?

A:

  • read the language in problem (bigger/less/differ)
  • in general, best to default to TWO TAIL test!

Tests for TWO populations.

Ex) A researcher suspects that a higher proportion of DU students are from out-of-state than for Oberlin.

Data:

  • Of 60 DU students, 51 were from out-of-state
  • of 80 Ob studetns, 58 were from out-of-state

1). Hypotheses

H0: p1 - p2 = 0 Ha: p1 - p2 > 0

(p1 = DU prop, p2 = Ob prop)

  1. test stat

z = (obs - exp)/stderr

(51/60 - 58/80 - 0)/sqrt( 109/140*(1-109/140)/60 + 109/140*(1-109/140)/80)
## [1] 1.76279
  1. p-val right tail area:
1-pnorm(1.76)
## [1] 0.0392039

Interpret: If there’s no difference in proportion btwn DU and Ob, then there’s a 3.9% chance of observing a sample difference as extereme as ours.

4). Conclusion

Since p-val < .05, we reject H0. There’s strong evidence that the true proportion of out-of-state studetns at DU is higher than at Oberlin.

2-pop tests for mu1 - mu2

Ex) Prof Miller wonders, is there a difference in mean delivery time btwn Pizza Hut and Dominos?

Collect data about deliveries:

Of 23 PH deliveries, mean time was 23.4 min, stdev 6.3 min. Of 28 Dom deliveries, mean time was 27.9 min, stdev 9.4 min.

  1. Hypotheses

H0: mu1 - mu2 = 0 (no difference) Ha: mu1 - mu2 != 0 (yes difference)

  1. test stat

t = (obs-exp)/stderr

(23.4 - 27.9 - 0)/sqrt( 6.3^2/23  +  9.4^2/28 )
## [1] -2.036769
  1. p-val

Q: what’s df?

A: use the smaller of the two here: df = 23 - 1 = 22

left tail area:

pt(-2.04, 22)
## [1] 0.02676758

p-val = 2* .026 = .052. (two tail test!)

4). Conclusion

Since p-val > .05, we fail to reject H0. There’s not strong evidence of a difference in mean delivery time.

Week 11

Monday April 3

Confidence Intervals for two populations

Note about p1-p2: don’t use pooled estimate for CI. WE’RE NOT ASSUMING p1 = p2 !!!!

H0: p1 - p2 = 0. <- default assumption in Hyp test

In confidence interval, no assumptions. Trying to estimate the difference!

Ex) A researcher wonders: is there a higher proportion of registered Republicans in Licking county and Franklin county?

Collect data:

  • Of 50 voters in Licking county, 32 were R
  • Of 100 voters in Franklin, 51 were R

Estimate the difference in proportion of R voters btwn the two counties, with 95% confidence. Crit val: z = 1.96

qnorm(.025)
## [1] -1.959964

low:

32/50 - 51/100 - 1.96*sqrt( 32/50*(1-32/50)/50 + 51/100*(1-51/100)/100 )
## [1] -0.03523393

hi:

32/50 - 51/100 + 1.96*sqrt( 32/50*(1-32/50)/50 + 51/100*(1-51/100)/100 )
## [1] 0.2952339

Interpret: we are 95% confident that the difference in prop of R voters is between -3.5% and 29.5%.

Since 0 is contained in the interval, there’s no strong evidence of a difference in proportion.

Connection btwn Hyp Test and CI

If the value in H0 is contained in the CI, then FAIL TO REJECT H0!

Otherwise, reject.

CI for two means

Ex) A reseracher wonders: is there a difference in mean ACT score btwn Licking county and Franklin county?

Data:

  • In 50 random Licking students, mean act score was 21.3, stdev 5.4.
  • In 80 random Franklin students, mean act score was 24.4, stdev 6.1.

Estimate the difference in mean ACT score with 95% confidence. critical value: t = 2.01

qt(.025, 49)
## [1] -2.009575

low:

21.3-24.4 - 2.01*sqrt(5.4^2/50 + 6.1^2/80)
## [1] -5.157994

hi:

21.3-24.4 + 2.01*sqrt(5.4^2/50 + 6.1^2/80)
## [1] -1.042006

We’re 95% confident that the true difference in mean act score is btwn -5.2 and -1.0.

Looks like mu1-mu2=0 isn’t likely. We reject H0: there’s strong evidence of a difference in mean ACT score btwn Licking and Franklin.

Note: Hyp test and CI will always have same conclusion AS LONG AS ALPHA AND CONFIDENCE MATCH!!!

Ex: if alpha = .05, then use 95% confidence.

Ex: if alpha = .01, then use 90% confidence.

Ch 12 - Linear Regression

Idea: compare TWO quant variables. (x,y)

main visualization: “scatterplot”. x-y plot of all data points.

Ex). mpg

head(mpg)
## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

Make a scatterplot to visualize cty (x) and hwy (y)

plot(mpg$cty, mpg$hwy)

cor(mpg$cty, mpg$hwy)
## [1] 0.9559159

Woah! Strong relationship! Makes sense - if more efficient in city, you’d expect more efficient on hwy.

Ex) Compare engine size (displ - x) with cty mpg (cty - y )

plot(mpg$displ, mpg$cty)

cor(mpg$displ, mpg$cty)
## [1] -0.798524

Woah! Strong negative relationship! As engine size increases, mpg decreases. Makes sense.

Ex). iris data

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Compare iris sepal length (x) and sepal width (y).

plot(iris$Sepal.Length, iris$Sepal.Width)

cor(iris$Sepal.Length, iris$Sepal.Width)
## [1] -0.1175698

Huh. No obvious relationship btwn sepal length and width. Weird but true.

The linear correlation coefficient - r

Note: capitol “R” is the software. lower. “r” is the correlation coefficient

Wed April 5

Ch 12 - Linear Regression

Ex). mpg.

plot(mpg$cty, mpg$hwy)

cor(mpg$hwy, mpg$cty)
## [1] 0.9559159

Linear Correlation Coefficient - r

r measures the STRENGTH and DIRECTION of a LINEAR relationship.

Properties of corr coeff

  •  -1 <= r <= +1    for ANY variables x,y
  • If r = 1, then PERFECT (colinear) POSITIVE LINEAR relationship
  • Id r=-1, then perfect NEGATIVE LINEAR relationship
  • If r~0, then no LINEAR relationship WARNING: always look at scatterplot with human eyeballs
  • r isn’t affected by units (lbs/kgs, in/cm, mi/km)
  • doesn’t matter which is x, which is y

Least-squares line

Many names:

  • best fit line
  • trend line
  • regression line
  • etc etc

Good news: all the same! One UNIQUE “best” line!

Q: What makes it the “best”?

A: it minimizes the sum of the SQUARE RESIDUALS!

Note: the sum/average of residuals is ALWAYS ZERO!!!

Computing equation for least squares

In linear model: we make predictions ABOUT y-variable, BASED UPON the x-variable.

x: input, independent variable y: output, dependedent variable

Be careful, it matters which is which!

Equation:

yhat = b0 + b1*x

where:

b0 = y-intercept b1 = slope

To compute:

  1. Find slope

b1 = r*sy/sx

Note: if correlation is STRONGER, then slope is STEEPER! Ie, if strong relationship, then x has bigger effect on y!

2). find intercept

Neat fact: the line ALWAYS touches the point (xbar,ybar)

->

b0 = ybar - b1*xbar

Ex). Data: height and weight of 50 men.

Summary:

Height:

avg height = 70.1” stdev. = 3.6”

Weight:

avg weight = 165.2 lbs stdev. = 11.9 lbs

cor = r = 0.71

Find equation for linear model to predict weight based on height.

Here: x=height, y = weight

  1. slope:
.71*11.9/3.6
## [1] 2.346944
  1. int
165.2 - 2.35*70.1
## [1] 0.465

yhat = 0.465 + 2.34*x

head(mpg)
## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

This is a good dataset: mixture/variety of BOTH categorical AND quantitative variables!

Friday April 7

Ch 12 - Lin Reg

Ex). mpg. Find equation for linear model for predicting cty (y) mpg based on engine size (displ -x ).

head(mpg)
## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

Summary stats

xbar = 3.47 sx. = 1.29 ybar = 16.86 sy. = 4.26 r. = -0.80

  1. slope = b1 = r*sy/sx = -2.64
-.8*4.26/1.29
## [1] -2.64186
  1. y-int = b0 = ybar - b1*xbar
16.86 + 2.64*3.47
## [1] 26.0208

yhat = 26.02 - 2.64*x

plot(mpg$displ, mpg$cty)

### Making prediction with least squares line

Plug in x!

Ex). A particular car has a 2.8L engine. Predict its cty mpg.

plug in x=2.8:

26.02 - 2.64*2.8
## [1] 18.628
plot(mpg$displ, mpg$cty)

Ex). Predict the cty mpg for a car with a 20.5L engine.

26.02 - 2.64*20.5
## [1] -28.1

Can’t! Shouldn’t! EXTRAPOLATION: making predictions outside the observed range of data. BAD!

Junk in - junk out!

Interpret the coefficients of linear model

Coeff: b0, b1

Slope: y=mx+b. If x increases by 1 unit, y inc/dec by m.

Lin reg: If [x-variable] increases by one [x-unit], then we predict/expect [y-variable] to inc/dec by [slope] [y-units].

For each additional L bigger a car’s engine is, we predict its city fuel efficiency to decrease by 2.64 mpg.

Intercept: if x=0, y = y-int

Lin reg: If [x-variable] is zero [x-units], then we predict/expect [y-variable] to be [y-int] [y units].

Car: If a car’s engine is 0L big (!!), we’d predict its city mpg to be 26.02 mpg.

EXTRAPOLATION! Not helpful.

Coefficient of determination - r^2

r^2 = percentage of variation in the y-data that’s due to/because of the linear relationship btwn x and y.

Ex). compute and interpret r^2 for car model.

Of all the variation in city fuel efficiency, about 64% is due to the linear relationship between fuel efficiency and engine size.

compute:

(-.8)^2
## [1] 0.64

Residual Plots

residual = y - yhat obs - exp

Ex). car model. A particular car has an 3.5 L engine and gets 17mpg in city. Compute the residual for this car.

x = 3.5 y = 17

yhat:

26.02 - 2.64*3.5
## [1] 16.78

resid:

17 - 16.78
## [1] 0.22

Our model under-approximated the mpg by 0.22mpg.

Residual plot:

x-coords: original x-data y-coords: residuals!

Ex). construct resid plot for mpg model.

x-coords = mpg$displ

first: predicted yhat:

yhat <- 26.02 - 2.64*mpg$displ

next: resids

resids <- mpg$cty - yhat

resid plot:

plot(mpg$displ, resids)

Week 12

Monday April 10

Residual Plots

In R - “lm” command.

Ex). Find linear model for predicting cty mpg based on displ (engine size).

idea: lm( y ~ x )

= “as a function of”

mpgModel <- lm( cty ~ displ, data = mpg)

Trick: look at variables with $

Summary:

summary(mpgModel)
## 
## Call:
## lm(formula = cty ~ displ, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3109 -1.4695 -0.2566  1.1087 14.0064 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  25.9915     0.4821   53.91   <2e-16 ***
## displ        -2.6305     0.1302  -20.20   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.567 on 232 degrees of freedom
## Multiple R-squared:  0.6376, Adjusted R-squared:  0.6361 
## F-statistic: 408.2 on 1 and 232 DF,  p-value: < 2.2e-16

Residual plot:

x: original x data y: residuals (from model)

plot(mpg$displ, mpgModel$residuals)

### Interpreting residual plots

Here’s what you DON’T want to see:

1). A pattern (ex: curve-shaped). Idea: A pattern in the residuals means that the errors are PREDICTABLE. Suggests linear model isn’t the best.

  1. Heteroskedasticity. Idea: we hope that residuals have CONSTANT MAGNITUDE across the range of observations (homoskedastic). If heteroskedastic, suggests the linear model is failing in some parts of the range. Bad!

Example:

Ex). Data = house sales. Make model to predict price based on sqft_living.

houseData <- read.csv("./House-Sales.csv")
houseModel <- lm( price ~ sqft_living, data = houseData)

residuals:

plot(houseData$sqft_living, houseModel$residuals)

Google image: “heteroscestic residuals”

In general: you HOPE your residuals are a “blob”. Best case scenario.

Ex) mpg. x=cty, y=hwy.

residuals:

mpgModel2 <- lm(hwy ~ cty, data = mpg)
plot(mpg$cty, mpgModel2$residuals)

Ah! No pattern, roughly homoskedastic. Good residuals!

Distribution of residuals

Idea: we hope that residuals follow a normal dist.

Ie: most close to zero, unlikely to be large.

Ex). mpg. x=cty, y=hwy.

hist(mpgModel2$residuals)

Good resids! Roughly symmetric and unimodal!

Ex). mpg. x=displ, y=cty.

hist(mpgModel$residuals)

Hm. Strong right skew suggests linear model not the best.

Multiple Linear Regression

Idea: more variables!

In R: use “+” in the lm command.

Ex). house sales. first: predict price based on bedrooms

houseModel1 <- lm(price~bedrooms, data=houseData)
summary(houseModel1)
## 
## Call:
## lm(formula = price ~ bedrooms, data = houseData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3508706  -203228   -66728   104982  6839614 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   129649       8938   14.51   <2e-16 ***
## bedrooms      121790       2556   47.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 349500 on 21611 degrees of freedom
## Multiple R-squared:  0.09507,    Adjusted R-squared:  0.09503 
## F-statistic:  2270 on 1 and 21611 DF,  p-value: < 2.2e-16

Interpret slope: For each additional bedroom, we predict the house’s price to increase by $121,790.

y-int: If a house has zero bedrooms (hmm….), we predict its price to be 129,649. Extrapolation.

Note: r^2 = .095. Low! Seems like bedrooms alone don’t account for much of the price.

Ex 2). Predict price based on bedrooms and bathrooms.

houseModel2 <- lm(price ~ bedrooms + bathrooms, data = houseData)
summary(houseModel2)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms, data = houseData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1484922  -184520   -42472   111528  5919466 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -30900       8276  -3.734 0.000189 ***
## bedrooms       20146       2666   7.557 4.27e-14 ***
## bathrooms     237934       3219  73.912  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 312200 on 21610 degrees of freedom
## Multiple R-squared:  0.2777, Adjusted R-squared:  0.2776 
## F-statistic:  4154 on 2 and 21610 DF,  p-value: < 2.2e-16

Ok, better! Now explaining 28% of sale price.

Ex3). Predict price based on bedrooms AND bathrooms AND sqft_living

houseModel3 <- lm(price~bedrooms+bathrooms+sqft_living, data=houseData)
summary(houseModel3)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living, data = houseData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1644794  -144361   -22891   102420  4178611 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  74662.099   6917.977  10.792   <2e-16 ***
## bedrooms    -57906.631   2336.062 -24.788   <2e-16 ***
## bathrooms     7928.708   3512.744   2.257    0.024 *  
## sqft_living    309.605      3.089 100.238   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 258000 on 21609 degrees of freedom
## Multiple R-squared:  0.5069, Adjusted R-squared:  0.5069 
## F-statistic:  7406 on 3 and 21609 DF,  p-value: < 2.2e-16

Best yet! This model explains over 50% of the variation in the y variable.

Idea for multiple variables: hunt for the variables that give highest R^2 value!

Wed April 12

Multiple regression

Idea: why not use many x variables!

Q: Which x variables to use?

1 - Choose variables that give highest R^2. More predictive power!

2 - Occam’s Razor: the simplest explanation is the best! For multiple regression: hope to use as few variable as possible.

Which variables to boot out of our model? Don’t use variables with high correlation! If strongly correlated, then using two doens’t give more predictive power than just one!

Useful tool in R: ggcorr

Ex). Use ggcorr to summarize correlation coefficients in iris dataset.

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Library: GGally

library(GGally)
ggcorr(iris)

Hm. Maybe don’t use BOTH Petal Width AND Petal Length in the same model. Highly correlated! Not adding to the model.

Residual Plots for Multiple Regression

Resid plot:

x: original x data y: residuals

Hey wait! If multiple x-variables, which to use?

Idea: for multiple regression residuals:

x: “fitted values” (yhat) y: residuals (y-yhat)

Ex). House sales. Construct linear model to predict price based on bedrooms, bathrooms, sqft_living. Make resid plot.

residExampleModel <- lm( price ~ bedrooms + bathrooms + sqft_living, data = houseData)

Resid plot:

x: fitted values y: resids

plot(residExampleModel$fitted.values, residExampleModel$residuals)

ggcorr(houseData)

Neat R trick: autoplot

library: ggfortify

library(ggfortify)
autoplot(residExampleModel)

interpret: upper left: strongly heteroskedastic. bad! upper right: deviating from normal dist. bad!

Making Predictions based on CATEGORICAL data.

Linear regression: needs TWO quantitative variables.

plot(iris$Sepal.Length, iris$Sepal.Width)

Ex). iris. Use species to predict sepal length.

sepal length ~ species

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

lm:

speciesModel <- lm(Sepal.Length ~ Species, data = iris)
summary(speciesModel)
## 
## Call:
## lm(formula = Sepal.Length ~ Species, data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6880 -0.3285 -0.0060  0.3120  1.3120 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.0060     0.0728  68.762  < 2e-16 ***
## Speciesversicolor   0.9300     0.1030   9.033 8.77e-16 ***
## Speciesvirginica    1.5820     0.1030  15.366  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5148 on 147 degrees of freedom
## Multiple R-squared:  0.6187, Adjusted R-squared:  0.6135 
## F-statistic: 119.3 on 2 and 147 DF,  p-value: < 2.2e-16

Wait! WHat happened to setosa?

A: one category gets “absorbed” into the y-int

Summary: when we predict on a category, our prediction is just the MEAN VALUE for that category!

Useful: combine it in a multiple regression model!

Ex). Iris: predict sepal length based on sepal width, petal lenth, petal width, species!

multiCategoryModel <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = iris)

summary(multiCategoryModel)
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, 
##     data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82816 -0.21989  0.01875  0.19709  0.84570 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.85600    0.25078   7.401 9.85e-12 ***
## Sepal.Width   0.65084    0.06665   9.765  < 2e-16 ***
## Petal.Length  0.70913    0.05672  12.502  < 2e-16 ***
## Petal.Width  -0.55648    0.12755  -4.363 2.41e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3145 on 146 degrees of freedom
## Multiple R-squared:  0.8586, Adjusted R-squared:  0.8557 
## F-statistic: 295.5 on 3 and 146 DF,  p-value: < 2.2e-16

Week 13

Monday April 17

Inference for Linear Regression

Inference: hyp tests and conf ints

Q: is r “big enough”?

A: hyp test!

Hyp Test for rho.

Hypotheses:

H0: rho = 0 (no lin rel) Ha: rho != 0 (yes, some relationship)

Warning: If reject H0, we only conclude there’s SOME linear relationship. Doesn’t necessarily mean STRONG relationship!

Ex) mpg. Predict cty mpg based on displ.

plot(mpg$displ, mpg$cty)

1) Hypotheses

H0: rho = 0 Ha: rho != 0

  1. Test stat

Cor coeff:

cor(mpg$displ, mpg$cty)
## [1] -0.798524

Sample size:

length(mpg$displ)
## [1] 234

plug in:

-.8*sqrt(234-2)/sqrt(1-(-.8)^2)
## [1] -20.30873

3). p-val

recall: cdf for t dist:

pt(-20.3, 232)
## [1] 1.181523e-53

Interpret: if it’s true that there’s no linear relationship btwn engine size (displ) and city mpg, then there’s an almost zero chance of observing a sample correlation as strong as ours.

4). conclusion

Since p-val <<< .05, we reject H0. There’s strong evidence to suggest a linear relationship btwn engine size and fuel efficiency.

Caveat: we are NOT concluding anything about STRENGTH of this relationship, nor that linear is the best model.

Inference for the SLOPE of regression line.

lm(mpg$cty~mpg$displ)
## 
## Call:
## lm(formula = mpg$cty ~ mpg$displ)
## 
## Coefficients:
## (Intercept)    mpg$displ  
##       25.99        -2.63

In r, to find std err for slope, just look at lm:

summary(lm(mpg$cty~mpg$displ))
## 
## Call:
## lm(formula = mpg$cty ~ mpg$displ)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3109 -1.4695 -0.2566  1.1087 14.0064 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  25.9915     0.4821   53.91   <2e-16 ***
## mpg$displ    -2.6305     0.1302  -20.20   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.567 on 232 degrees of freedom
## Multiple R-squared:  0.6376, Adjusted R-squared:  0.6361 
## F-statistic: 408.2 on 1 and 232 DF,  p-value: < 2.2e-16

Ex). Estimate the population slope for the model (x=displ, y=cty) with 95% confidence.

b1 = -2.631 sb = 0.130 n = 234

critical value: t = 1.970

qt(.025, 232)
## [1] -1.970242

plug and chug!

lo:

-2.631 - 1.970*0.130
## [1] -2.8871

hi:

-2.631 + 1.970*0.130
## [1] -2.3749

We’re 95% confident that true population slope btwn displ and cty mpg is btwn -2.887 mpg and -2.375 mpg.

Ie, for each addional liter bigger a car’s engine is, we’d expect the cty mpg to decrease by between 2.887 and 2.375 mpg.

Chi Square Test

Idea: do two (or more) different distributions match up?

Ex). A cognitive psychologist wonders: do babies prefer a particular color of toy?

To investigate, 100 babies are presented with 4 identical toys: one blue, one yellow, one green, one red.

Results: 35 babies pick red, 15 pick yellow, 20 pick blue, and 30 pick green.

Q: is there evidence for a preference of one color? Q: is there evidence that not all colors equally likely?

If preference is equaly, how many would we expect for each color? 25!

Test:

1). Hypotheses

H0: Ha:

2). Test stat

chisq = (35-25)^2/25 + (15 - 25)^2/25 + (20-25)^2/25 + (30-25)^2/25

(35-25)^2/25 + (15 - 25)^2/25 + (20-25)^2/25 + (30-25)^2/25
## [1] 10

chisq = 10.000

df = #categories - 1

3). p-val

1-pchisq(10.000, 3)
## [1] 0.01856614

p-val = .019

If it’s true that all colors are equally likely to be chosen, then there’s a 1.9% chance of observing a sample result like ours.

4). Conclusion

Since p-val < .05, we reject H0. Found strong evidence that not all colors equally likely to be chosen.

Friday April 21

table() command and 2-prop test

Q: is there statistically sig evidence of a difference in proportion of cars that have 6 cyl btwn compact and midsize cars?

In R: table command.

addmargins(   table(mpg$cyl, mpg$class)   )
##      
##       2seater compact midsize minivan pickup subcompact suv Sum
##   4         0      32      16       1      3         21   8  81
##   5         0       2       0       0      0          2   0   4
##   6         0      13      23      10     10          7  16  79
##   8         5       0       2       0     20          5  38  70
##   Sum       5      47      41      11     33         35  62 234
  1. Hypotheses

H0: p1 - p2 = 0 Ha: p1 - p2 != 0

  1. test stat

pooled proportion: 36/88

(13/47-23/41)/sqrt(36/88*(1-36/88)/47 + 36/88*(1-36/88)/41)
## [1] -2.706625
  1. p-val
2*pnorm(-2.707)
## [1] 0.006789426

4). Conclusion

Since p-val < .05, we reject H0. We’ve found strong evidence of a difference in proportion of 6-cyl cars btwn compact and midsize cars.

“Stacked” bar graph

  1. make table
  2. make bar graph

Ex). bar graph of cyl.

myTable <- table(mpg$cyl, mpg$class)
barplot(myTable, beside=T)

Week 14

Mon April 24

ANOVA

ANOVA = “analysis of variance”

Idea: Is there evidence of a difference in means (MORE THAN 2 GROUPS!)

Ex). Iris. Is there evidence of a difference in mean sepal length between the 3 species: setosa, versicolor, virginica?

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Visualization: boxplot

boxplot(iris$Sepal.Length ~ iris$Species, horizontal = T)

  1. hypotheses

H0: means are equal (no differnce) Ha: at least one differs

  1. Test stat

Idea: F = (variation between groups) / (variation WITHIN groups)

compute in r:

irisAnovaModel <- aov(iris$Sepal.Length ~ iris$Species)
summary(irisAnovaModel)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## iris$Species   2  63.21  31.606   119.3 <2e-16 ***
## Residuals    147  38.96   0.265                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Test stat = F = 119.3

3). p-val

p-val ~ 0

4). Conclusion

Since p-val < .05, we reject H0. We’ve found strong evidence of a difference in mean sepal length between the three species.

boxplot(iris$Sepal.Length~iris$Species, horizontal = T)

Chi Square Test for independence

Idea: is there evidence of a relationship between two CATEGORICAL variables?

Ex). mpg

head(mpg)
## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…
  1. Hypotheses

H0: cyl and class are independent Ha: there’s some relationship btwn cyl and class

  1. Test stat

Compute expected counts for all cyl/class combiniations.

Idea: If A,B independent, then P(A and B) = P(A)*P(B)

In r: chisq.test()

chisq.test(mpg$cyl, mpg$class)
## 
##  Pearson's Chi-squared test
## 
## data:  mpg$cyl and mpg$class
## X-squared = 138.03, df = 18, p-value < 2.2e-16
ourModel <- chisq.test(mpg$cyl, mpg$class)
ourModel$observed
##        mpg$class
## mpg$cyl 2seater compact midsize minivan pickup subcompact suv
##       4       0      32      16       1      3         21   8
##       5       0       2       0       0      0          2   0
##       6       0      13      23      10     10          7  16
##       8       5       0       2       0     20          5  38
ourModel$expected
##        mpg$class
## mpg$cyl    2seater    compact    midsize   minivan     pickup subcompact
##       4 1.73076923 16.2692308 14.1923077 3.8076923 11.4230769 12.1153846
##       5 0.08547009  0.8034188  0.7008547 0.1880342  0.5641026  0.5982906
##       6 1.68803419 15.8675214 13.8418803 3.7136752 11.1410256 11.8162393
##       8 1.49572650 14.0598291 12.2649573 3.2905983  9.8717949 10.4700855
##        mpg$class
## mpg$cyl       suv
##       4 21.461538
##       5  1.059829
##       6 20.931624
##       8 18.547009

chisq = 138.03

  1. p-val

p-val ~ 0

  1. Conclusion

Since p-val < .05, we reject H0. There’s strong evidence of a relationship btwn cyl and class.

Visualization: stacked bar plot

  1. table
  2. barplot
mpgTable <- table(mpg$cyl, mpg$class)
mpgTable
##    
##     2seater compact midsize minivan pickup subcompact suv
##   4       0      32      16       1      3         21   8
##   5       0       2       0       0      0          2   0
##   6       0      13      23      10     10          7  16
##   8       5       0       2       0     20          5  38
barplot(mpgTable, beside=T, legend=unique(mpg$cyl))